Searchable molecular database

ABSTRACT

A computer system comprising a database ( 100 ) having a plurality of records, is provided. Each record comprises a filed point representation representing field extrema for a conformation of a chemical structure. The database may include records for multiple conformations of the same chemical structure. Each record can have a searchable index of the filed point representation. In one embodiment the index is bit string. An indexing mechanism for generating an index, a searching mechanism for searching the database and a graphical user interface to enable a user to interface with the database ( 100 ) are also provided.

BACKGROUND OF THE INVENTION

The invention relates to a database of representations of molecules indifferent conformations which can be searched in order to find molecularconformations with similar field properties, as is useful for drugdiscovery.

A number of databases exist which allow comparison between structuralrepresentations of large numbers of molecules in different conformations[see e.g. references 1, 2]. Databases of this kind are useful forpharmaceutical research, since a known compound with a particular knownactivity can be used as a search query to identify other compounds withsimilar molecular structures. These other compounds can then be used asleads and can be studied to establish whether they exhibit similaractivity.

One way to compare molecular conformations is to perform atom-atomsearching in which each atom and, bond of a molecule (includingproperties such as valence charge) is compared. Many algorithms havebeen produced to accomplish atom-atom comparison searching. A popularalgorithm is that produced by Ullman or derivations based upon it.Whilst atom-atom searching is an effective way of comparing molecules,it is computationally intensive and hence slow. Search speeds becomeunacceptably slow for the average user even when searching acrossdatabases containing only a modest number of records.

To speed up the searching process it is conventional to initiallyperform an index-based search before atom-atom searching, which is thenlimited to the hits found in the index-based search. An index is acondensed representation of a molecular conformation. A commonly usedindex type is the bit string (also referred to as a bit map). Bitstrings can be rapidly compared using bit-wise operations.

For each molecular conformation an index is created from a definition ofthe conformation based on its structural properties, such as its atomtypes and properties of the inter-atomic bonds, such as bond length,angle etc. Two common bit string indexing methods use structural keyindexes (also referred to as data dictionary indexes) and fingerprintindexes (also referred to as hashed indexes).

Much work has been carried on devising less specific representations formolecules. These take features of a molecule and reduce them tocharacter representations, for example aromatic rings (A), linker chains(CH2) (L), electron withdrawing atoms (W), electron donating atoms (D),hydrogen acceptor atoms (HA), and hydrogen donating groups (HD). Thisallows a complex molecule to be represented by a simple abbreviatedreduced molecule. These reduced molecules can be indexed just as if theyhad full atom representation, and used in search and metriccalculations.

Through the use of similarity metrics researchers have devisedclustering methods. These include K-Means, Nearest-Neighbour andJarvis-Patrick algorithms, to name a few. These allow sets of bitstrings to be grouped into bins or clusters, indicating that somerelationship exists between them. Once clustered the bit strings may befurther analysed to search for common bits (features) which tend topredominate in specific groups. These features have then been utilisedfurther in quantitative structure-activity relationship (QSAR) analysisto relate biological activity with bit features. QSAR analysis is astandard term describing the calculation or measurement of one or moreproperties of a set of molecules and then attempting to relate thebiological activities of the molecules to their properties (e.g. byregression).

While index-based searching across molecular databases has proved to bea powerful tool, it has some limitations. In particular, the searchingis not generally good at finding new lead compounds which arestructurally dissimilar to the search query compound. This is aconsequence of the structure-based approach used in existing databasesfor the indexing. It is therefore desired to create a molecular databasewith an improved indexing system which is capable of finding leadcompounds independent of structural similarity.

SUMMARY OF THE INVENTION

Viewed from a first aspect the present invention provides a computersystem comprising a database having a plurality of records, wherein eachrecord comprises a field point representation representing field extremafor a conformation of a chemical structure.

Field point representations are independent of the structural class of achemical structure. By providing a database with records comprisingfield point representations, searches can be performed by field pointrepresentation rather than chemical structure. Advantageously, searchescan identify chemical structures of different structural class to thatof a search query. Thus, the database can provide hits which are not beobtainable by known chemical structure databases and hits that arelikely to have diverse chemical structures.

In a particular embodiment the database includes records for multipleconformations of the same chemical structure. Advantageously, multiplefield point representations for the same chemical structure can besearched, increasing the likelihood of the chemical structure beingincluded as a hit in the search results.

In one embodiment an index of the field point representation isassociated with each record, the index being a searchable representationof the field point representation.

Preferably the index is a string. Each element of the string may be abinary digit (bit) so that the string is a bit string. Alternatively,the string elements may be more than two-valued, for example they mayhave values in the range 0 to 3 or 1 to 10. In this case the stringelements are referred to as bins. (Use of bits for the string elementscan thus be thought of as a special case in which the bin can only adopttwo-values.) In one embodiment the string elements or bins take realnumber values (rather than being restricted to integer values).Advantageously, by using a string, known string manipulation techniquescan be used.

Multiple indexes of the field point representation may be associatedwith each record, the multiple indexes being representations of thefield point representation at different precision levels. This enables auser to search at different precision levels.

In a preferred embodiment, the index is a string of length n and thecomputer system comprises an indexing mechanism for generating an indexof a field point representation. The indexing mechanism is configuredto:

(i) generate a numeric identifier from a characteristic of the fieldpoint representation;

(ii) generate one or more numbers in a range from 1 to n (e.g. 0 to n-1)in dependence on the numeric identifier;

(iii) increment the bins in the string that correspond to the one ormore numbers; and

(iv) optionally repeat (i) to (iii) for another characteristic of thefield point representation.

Thus, a mechanism for generating a string from a field pointrepresentation is provided.

A characteristic of the field point representation may include one ormore of:

the number of field points of a particular field of the field pointrepresentation;

the particular field and energy of a field point in the field pointrepresentation; and

the respective energies of and distance between a field point pairing inthe field point representation.

In a preferred embodiment the indexing mechanism is configured togenerate one or more numbers in a range from I to n in dependence on thenumeric identifier by using a deterministic function, such as apseudo-random number generator or a hash function.

The computer system may also comprise a searching mechanism configuredto:

(i) compare a query index with an index of a field point representationfor a record in the database;

(ii) identify the record as a hit if the comparison satisfies a searchcriterion; and

(iii) repeat (i) and (ii) for a plurality of records.

Viewed from another aspect the present invention provides a database forimplementation on a computer system, the database configured to supporta plurality of records, each record comprising a field pointrepresentation representing field extrema for a conformation of achemical structure.

In another aspect the present invention provides computer softwareconfigured to provide the database defined herein and in a furtheraspect provides a carrier medium carrying the computer software.

Viewed from yet another aspect the present invention provides a methodof generating an index of a field point representation representingfield extrema for a conformation of a chemical structure, wherein theindex is a string with n elements, the method comprising:

(i) generating a numeric identifier from a characteristic of the fieldpoint representation;

(ii) generating one or more numbers in a range from 1 to n in dependenceon the numeric identifier;

(iii) incrementing the string elements that correspond to the one ormore numbers; and

(iv) optionally repeating (i) to (iii) for another characteristic of thefield point representation.

In the case that the string is a bit string, the incrementing step willbe one of setting the bit to 1 (or the reverse in the case that the bitstring is initialised to ones rather than zeroes). On the other hand,when the string elements are many-valued bins, the bin value isincremented until its maximum is reached.

The method may further comprise using a deterministic function togenerate one or more numbers in a range from 1 to n in dependence on thenumeric identifier.

Viewed from yet another aspect the present invention provides a methodof searching a database having a plurality of records, each recordcomprising a field point representation representing field extrema for aconformation of a chemical structure and having an index of the fieldpoint representation, the method comprising:

(i) comparing a query index with an index of a field pointrepresentation for a record in the database;

(ii) identifying the record as a hit if the comparison satisfies asearch criterion;

(iii) repeating (i) and (ii) for a plurality of records; and

(iv) outputting a representation of the records identified as a hit.

BRIEF DESCRIPTION OF THE DRAWINGS

For a better understanding of the invention and to show how the same maybe carried into effect reference is now made by way of example to theaccompanying drawings in which:

FIG. 1 is a flow diagram illustrating the steps in the generation of afieldprint;

FIG. 2 is a flow diagram illustrating the steps performed for fieldprintsearching;

FIG. 3 is an overview of the database;

FIG. 4 illustrates the database schema; and

FIG. 5 is a schematic representation of a computer system.

DETAILED DESCRIPTION

The present invention relates to a computer system comprising a databasehaving a plurality of records, wherein each record comprises a fieldpoint representation representing field extrema for a conformation of achemical structure.

The computer system comprises an indexing mechanism for generating asearchable index in the form of a bit string for each field pointrepresentation. A bit string is stored in the database for each record.

The computer system also comprises a searching mechanism for searchingthrough the indexes stored in the database to identify field pointrepresentations that match the field point representation of a searchquery. Known searching algorithms can be used.

A suitable user interface, for example a graphical user interface (GUI)is provided to enable a user to interface with the database. A user canuse the user interface to input data to and output data from thedatabase, to search the database and to browse the database.

The following sections describe in more detail the field pointrepresentations, the generation of indexes and searching the database.After these sections an overview of a particular embodiment of adatabase is given, followed by a detailed description of the databasestructure of the particular embodiment and a description of a computersystem.

I. Field Point Representations

It is possible to predict the binding properties of a candidatemolecule, or other chemical structure, by representing the physicalproperties of a molecule which are important in its binding to othermolecules, and then assessing the similarity between two such sets ofphysical properties, one for the candidate molecule and one for a wellcharacterised molecule.

Accurate molecular modelling is possible using advanced quantummechanics. However, the computational effort needed for quantummechanics is prohibitive for most biologically relevant molecules.

An alternative approach is called molecular mechanics. The most commonway of implementing molecular mechanics in three dimensions is tocalculate and compare fields around a molecule, such as the steric (vander Waals) and electrostatic (Coulombic) fields. The principles ofmolecular mechanics are simple and empirical. Moreover, molecularmechanics is computationally fast enough to cope with large proteins andother biopolymers associated with drug design.

In molecular mechanics electrostatic properties of a molecule aredefined by placing a point charge at the centre of each atom(atom-centred charges or ACCs). Many different methods for calculatingor estimating the value of such point charges are described in theliterature. The aim of ACC methods is to distribute the point charges insuch a way that the resulting electrostatic field is as similar aspossible to the true electrostatic field (as determined by quantummechanics methods). The electrostatic field as approximated by ACCs isusually quite accurate at a distance from the molecule (>5 Å), but canbe quite inaccurate at the molecular surface.

To improve the quality of molecular mechanics models at the molecularsurface, extended electron distributions (XEDs) have been developed. TheXED method involves replacing the point charge at the centre of someatoms with a set of point charges, one at the centre of the atom and oneor more others distributed around that atom a short distance away. TheXED method is described in Vinter (1994) [5] and Vinter and Trollope(1995) [6]. In the XED method, the XEDs themselves are treated simply asextra atoms which have charge but no volume. XED methods can thereforecalculate electrostatic interactions more accurately than ACC methods,while retaining the speed advantages of the molecular mechanicsframework.

Quantum mechanical models and molecular mechanical models, such as ACCor XED models, can use the concept of field points to represent themolecular field. In this approach, the conformation of a molecule, i.e.its equilibrium arrangement either in isolation or when bound to anotherspecific molecule or surface, is represented by a set of field pointswhich measure field strength at a relatively small number of fieldmaxima and minima around the molecule which are relevant to how themolecule is likely to interact with other molecules.

In order to calculate field points, a field definition must be adopted.One known field definition for molecular mechanical models uses positiveand negative electrostatic interaction fields in combination with asurface interaction field. The two electrostatic interaction fields aredefined by the interaction energy of a specific charged ‘probe’ moleculewith the molecule of interest. For example, a probe the size of anoxygen atom, with either a +1 or a −1 unit charge, can be used. Thefield value at a given point is the interaction energy of the moleculewith the probe atom sited with its centre at that point. The surfaceinteraction field is defined by the van der Waals interaction energy ofa neutral ‘probe’ with the molecule, for example an uncharged oxygenatom.

Other field definitions have been used, for example ones that includeelectrostatic fields calculated from quantum molecular methods, and onesthat include hydrophobic fields calculated from the electrostatic fieldand its partial derivatives. In principle, any field definition can beused provided that its value can be defined at any point in space aroundthe molecule.

Once the field definition has been made, the field points of themolecule need to be calculated. With the molecular modelling approach,the field points are subdivided into a number of subsets, one for eachfield type, with each subset being calculated separately. The fieldpoints for a molecule are the values and locations of the extrema of itsfield, i.e. local maxima and minima. The final set of field points fromeach field type can be filtered to remove duplicate extrema and extremawith small energy values if desired.

The field point set encodes a large amount of information about theproperties of the molecule, especially regarding its interaction withother molecules. The electrostatic field points encode information aboutthe preferred hydrogen-bonding environment of the molecule, while thesurface interaction field points encode the molecule's steric bulk.

The basic assumption underlying the field point approach is that twomolecules which have similar sets of field points should have similarinteractions with other molecules and hence should have similarbiological activities. In other words, if molecule A has a certainbiological activity, and molecule B is calculated to be similar tomolecule A in a relevant conformation, then it is concluded thatmolecule B potentially has the same biological activity.

A field point representation therefore represents field extrema for aconformation of a chemical structure. Typically a field pointrepresentation includes a set of field points where each field point hasa position and a field size value.

A field point representation may represent field extrema for a pluralityof fields. In the example used herein the field point representationrepresents four fields, namely positive and negative electrostaticinteraction fields, a surface interaction (i.e. steric) field, and ascaffold field.

Field point representations can be compared directly. For example, thesimilarity between conformations of two molecules can be calculatedaccording to a scoring formula which is sensitive to differences betweenthe field point positions and energy values of the field points in thetwo field point sets.

However, it is desirable to generate a searchable index of a field pointrepresentation so that indexes can be stored in the database andsearched upon to perform a screen out before further comparisons of thesearch results are performed, if required. Generating searchable indexesof a field point representation is non-trivial.

Field point representations are also referred to as field patternsherein and the terms can be used interchangeably.

II. Index Generation

A searchable index of the field point representation is created in theform of a fingerprint-type bit string.

A fingerprint is generated from the molecule using a fingerprintingalgorithm that examines the molecule and generates a pattern. Typicalexamples that are used include a pattern for each atom; a pattern foreach atom and its nearest neighbour plus the joining bond; a pattern foreach atom, its nearest neighbour, joining bond and further neighboursand bonds for varying path lengths; and a pattern for augmented atoms.The list of patterns produced is exhaustive, such that every pattern inthe molecule up to the specified path length limit is generated. Eachpattern serves as a seed to a pseudo-random number generator (i.e. it ishashed). The output of the pseudo-random number generator is a set ofbits (typically 4 or 5 per pattern) which is added to the fingerprintwith a logical OR. The creation of the seed is coded so as to produce aunique value for the pattern and hence the random number generation.Because each set of bits is produced by a pseudo-random numbergenerator, it is likely that some bits will overlap. However, by setting4 or 5 bits per pattern the probability that keys will be identical isreduced to an insignificant level for screen out purposes. The size ofthe bit string may be set independently since, unlike keys, a bit doesnot have an exact meaning in the fingerprint. A bit string size of 2K(2048 bits) is commonly used as a compromise between speed and overlap.However other fingerprint sizes such as 1K, 4K and 8K could be used.

Fingerprints have the important property that, if a pattern is asubstructure of a molecule, every bit in the pattern's bit string willbe set in the molecules bit string. This means that simple boolean orbit-wise operations can be used. Each bit of a fingerprint can bethought of as being shared among an unknown but large number ofpatterns. Each pattern generates its particular set of bits. So long asat least one of those bits is unique, it can be established if thepattern is present or not. If a fingerprint indicates a pattern ismissing then it certainly is, but it can only indicate a patternspresence with some probability. Since fingerprints have no predefinedset of patterns, one fingerprinting system can be used to serve alldatabases and all types of queries.

Although not used in the current implementation, the fingerprint may befolded. Folding is a term used to describe a process whereby afingerprint is halved in size by performing a logical OR on each half ofthe fingerprint. The result is a shorter fingerprint with a higher bitdensity. One can continue to fold until the desired bit density isachieved. With each fold one increases the chances of a false positivebut one saves half the space required to store the fingerprint. Sinceone can only compare fingerprints of the same length some work must bedone when querying to ensure there are bit strings of suitable lengthavailable for comparison.

Bit string theory is described in Mooers (1951 and 1956) [3, 4]. Thebasic principles that can be used and some advanced techniques which maybe applied to bit strings will now be described.

Bit strings are an array of bits that are either set to zero or one(True or False). The length of the bit strings can vary depending on thetype of index being created.

When the presence of a substructure is tested the bit strings arecompared using a logical AND. For example, consider the following two 8bit bit strings A and B.

A: 10100100

B: 11100110

Imagine both have been created using the same indexing method for thecharacterisation of a molecule B and a substructure query A.

One can test to see if the substructure is likely to exist in the mainmolecule by testing the following equation as true or falseB & A=A where & is logical AND

For the example above a true result is produced, however if A isreplaced with 10010100 a false result is produced. So one would know forcertain that the substructure does not exist and should not waste timeanalysing the molecule further.

An exact match can be tested for by using B & A=B

The present system implementation allows bit strings to be compared forsimilarity using Tanimoto coefficient, Euclidian distance or Tverskysimilarity comparison techniques, each of which is now brieflydescribed. Other bit-string comparison algorithms could also beprovided. In one embodiment bit strings are compared for similarityusing the Kulczynski metric.

The Tanimoto coefficient can be described as the number of bits incommon between two bit strings divided by the total number of bits. Thisis an intuitive similarity measure as it is normalised to account forthe number of bits that might be in common relative to the number thatare in common. The equation can only be used as a similarity metric.

For two bit strings A and B their Tanimoto similarity is given by theequationTS=BCm/(BCa+Bcb)−BCmwhere

BCm is the number of bits set to 1 in common between the two bit strings

BCa is the count of bits set to 1 in bit string A

BCb is the count of bits set to 1 in bit string B

The results from this comparison range between 0 and 1, with 0 being theleast similar and 1 being the most similar.

Euclidian distance is a measure of the geometric distance between twofingerprints, where each is thought of as a vector in multi-dimensionalspace. It can be used as a measure of similarity and as a substructuresearch metric depending on how it is applied.

Tversky similarity provides a most powerful metric. Like the Tanimotometric, Tversky compares the features in a query bit string to featuresin the given (database) bit string. However, Tversky allows one tospecify the weighting that will be given to each set of features. Thisallows the Tversky metric to be used in similarity, substructure andsuperstructure searching. The basic weightings are usually between 0 and1 (0-100%) giving a ratio model. However the equation can be modified toaccept weightings>100% thus providing a contrast model which causesdistinguishing features to be emphasised more than the common featureswhich may be more useful in diversity or dissimilarity metrics.

For two bit strings A and B their Tversky similarity is given by theequationTvS=BCm/(αBCa+βBCb)−BCmwhere

BCm is the number of bits set to 1 in common between the two bit strings

BCa is the count of bits set to 1 in bit string A

BCb is the count of bits set to 1 in bit string B

α is the weighting to be given to bit string A

β is the weighting to be given to bit string B

If both weightings are set to 100 then the Tversky equation gives thesame results as the Tanimoto similarity. By varying the weightings theuser can adjust how the bit strings are compared in terms of sub orsuper pattern similarity between the two bit strings.

Instead of the fingerprint bit string indexes used in the currentimplementation, data dictionary bit string indexes could be used.

Data dictionary indexes are also known as structural keys. A structuralkey is represented as a boolean array in which each element is true orfalse. Boolean arrays in turn are represented as bit strings in whicheach bit represents one position of the boolean array. A structural keyis a bit string in which each bit represents the presence (true) orabsence (false) of a specific structural feature (pattern). A fragmentlibrary is created of the patterns that are considered important, eachpattern being assigned to a bit of the bit string. The number offragments in the library dictates the bit string length. The bit stringfor a molecule is created by carrying out a substructure search of eachstructure or pattern in the fragment library and setting itscorresponding bit in the bit string appropriately. Depending on thenumber of fragments in the library this can be a time consuming process.When a database is searched for a particular structural feature, asearch key is generated. As the search proceeds, the search key iscompared to the bit string of each molecule in the database. If a TRUEbit in the search key is not also set as TRUE in the molecule's key,then the structural feature represented by that bit is not in themolecule, so the molecule can be excluded from consideration.

Structural keys, like fingerprints, have the important property that, ifa pattern is a substructure of a molecule, every bit in the pattern'sbit string will be set in the molecules bit string, thus allowingboolean or bit-wise operations to be used to compare bit strings.

Using bit strings as indexes allows rapid bitwise comparison usingsimple AND, OR, XOR and NOT computer operations. They are alsoparticularly suitable to use in similarity measures based on thenumerous similarity formulae that exist. The method by which data isencoded into a bit string is known as fingerprinting. Whilst the use offingerprinting and bit strings is known, the approach has never beenapplied to field point representations. In other words generating bitstrings from field point representations is new.

In one embodiment an indexing mechanism is used to generate an index ofa field point representation. The indexing mechanism may be implementedon a computer system as software, firmware or hardware, although in aparticular embodiment it is implemented as software.

In a particular embodiment the index is a bit string of length n and theindexing mechanism is configured to:

(i) generate a numeric identifier from a characteristic of the fieldpoint representation;

(ii) generate one or more numbers in a range from 1 to n in dependenceon the numeric identifier;

(iii) set the bits in the bit string that correspond to the one or morenumbers; and

(iv) optionally repeat (i) to (iii) for another characteristic of thefield point representation.

Thus, starting with a bit string of length n with all n bits set to zero(or indeed with all n bits set to 1), bits of the bits string can be setin dependence on one or more characteristics of the field pointrepresentation. Suitably, one or more characteristics are identified,one or more numeric identifiers are generated, and one or more numbersbetween 1 and n are generated. These features will now be described.

II.A. Characteristics

The characteristic of the field point representation can be any propertyand/or relationship that exists within the data.

The properties that can exist in a field point representation includethe field type of each field point (for example negative, positive,surface, scaffold); the size or energy of each field point; the totalnumber of field points; the number of each type of field point; and theX, Y, Z coordinates of a field point.

Relationships which can be derived from the properties include thepairwise distance relationship between two field points; the anglesbetween three field points; the triangulation distances between threefield points; any other relationship of interest between two or morefield points

Any or all of the properties and relationships may be used by theindexing mechanism or a fingerprinting algorithm to generate an index(fingerprint) from a given field point representation (field pattern).

In one embodiment a characteristic of the field point representationincludes one or more of:

the number of field points of a particular field of the field pointrepresentation;

the particular field and energy of a field point in the field pointrepresentation; and

the respective energies of and distance between a field point pairing inthe field point representation.

A characteristic of the field point representation is used to generate anumeric identifier which in turn is used to generate one or more numbersbetween 1 and n for setting bits in the bit string. In order tounderstand the generation of the numeric identifier from a field pointrepresentation, the generation of one or more numbers between 1 and n independence on the numeric identifier will first be described.

II.B. Generation of Numbers Between 1 and n

In one embodiment the indexing mechanism is configured to generate oneor more numbers in a range from 1 to n in dependence on the numericidentifier by using a deterministic function.

A deterministic function is a function which takes a value as an inputvalue or seed and generates one or more output values in dependence onthe input value such that the one or more output values for any giveninput value is always the same.

For example, if a deterministic function is seeded with the number 27 toproduce four output values, it may output the values 0.23, 0.33, 0.21and 0.88. If the same function is subsequently seeded with the number27, then it will output the same four values, namely 0.23, 0.33, 0.21and 0.88.

Deterministic functions can be used to generate one or more integeroutput values between 1 and a number n, by converting the output valuesto integers in this range. This can be done by scaling and rounding theoutput values.

For example, certain deterministic functions can generate all outputvalues between 0 and 1. These can be scaled to an integer value between1 and n by using the formula:integer value=ROUND(output value*(n-1)+1)

An integer value generated in this way can be used to set acorresponding bit in a bit string. If, for example, the deterministicfunction is seeded to produce four output values from one seed (inputvalue) then four integer values can be generated and used to set fourbits in the bit string.

Examples of deterministic functions are hashing algorithms and pseudorandom number generators. The current system implementation uses apseudo random number generator.

In one embodiment known length bit strings are used. Starting with a bitstring containing only a series of 0's, the basis of the approach is tocreate a unique identifier (number) for each and every property orrelationship contained within the field pattern. The unique identifieris used as a seed to initialise a random number generator. The randomnumber generator is used to provide a series a numbers (commonly 4numbers) between 1 and the length of the bit string. The numbersproduced are used to set the corresponding bit in the bit string to 1.After cycling around all the properties or relationships that are to beanalysed, the bit string will contain a series of 0's and 1's which areunique to that field pattern.

An important part of creating any bit string index is to create theunique identifier for a defined property or relationship. Once created,the unique identifier will always produce the same sequence from adeterministic function.

II.C. Generation of the Numeric Identifier

The indexing mechanism can be configured to take a measurement of acharacteristic to generate the numeric identifier.

In a particular example for generating a bit string (including thegeneration of the numbers in a range from 1 to n), the indexingmechanism uses the fingerprinting algorithm detailed below in pseudocode. The code is applied to each field point representation (fieldpattern) being stored in the database giving an index (fingerprint) foreach record.

The code is exemplified using a bit string length of 2048 however; bitstrings of any appropriate length can be used.

1. A bit string of length 2048 is created consisting entirely of 0's(zeros)

2. For each field type (negative, positive, surface, scaffold)

-   -   a. Count the number of field points of that type in the pattern.    -   b. Encode the field type and the field point count into a        preferably unique numeric identifier    -   c. Seed a pseudo random number generator with the numeric        identifier    -   d. Obtain four numbers from the pseudo random number generator        between 0 and 2047 (to span a range from 1 to 2048 and use them        to set the corresponding bit in the bit string to 1.

3. For each field point in the pattern

-   -   a. Encode the field type and the field point energy into a        preferably unique numeric identifier    -   b. Seed a pseudo random number generator with the numeric        identifier    -   c. Obtain four numbers from the pseudo random number generator        between 0 and 2047 and use them to set the corresponding bit in        the bit string to 1.

4. For each field point pairing in the field pattern

-   -   a. Calculate the distance (to a given precision) between the two        points from their X, Y, Z coordinates.    -   b. Encode the two field types and distance between them into a        unique numeric identifier    -   c. Seed a pseudo random number generator with the numeric        identifier    -   d. Obtain four numbers from the pseudo random number generator        between 0 and 2047 and use them to set the corresponding bit in        the bit string to 1.

FIG. 1 illustrates a fingerprint generation method. It is noted that theflow diagram refers to bins rather than bits. However, the bins in thisembodiment can only adopt values of 0 or 1, so that bin and bit aresynonymous. In the more general case where each bin can adopt anarbitrary number of values, the step of “Set all bins to 0” will be thesame, but the step of “Set corresponding bins to 1” will become one ofincrementing the bin values.

The resulting fingerprint bit string contains a series of 1's and 0'swhich encodes the nature of the field pattern. The fingerprint generatedis then stored in the database.

In step 4 it is possible to alter the precision at which the distancebetween two field points is measured. In the current example fourprecision levels (1, 0.5, 0.25 and 0.1 Angstroms) are used.

This means that for each field pattern registered to the database fourFingerprints are generated and stored in the database. This allowssearches to be carried out over the database at different precisionlevels. Thus it will be appreciated that in one embodiment the indexingmechanism is configured to take a measurement of a characteristic atdifferent levels of precision to generate corresponding multiple indexeswhich represent the field point representation at different precisionlevels.

Other methods can be used to encode the field pattern into a bit string.For instance three field point comparisons (Triangles) may be usedrather than the two field point comparison detailed above. In this casethe same procedures as outlined above can be used except in section 4the information for each three field point grouping would be encoded.

In another embodiment the indexing mechanism is configured to:

(i) define a plurality of ranges of possible measurement values;

(ii) take a measurement of a characteristic of the field pointrepresentation to produce a measurement value;

(iii) assign the measurement value to a range if the measurement valueis within the range;

(iv) optionally repeat (ii) and (iii); and

(v) use the number of measurement values assigned to a range to generatethe numeric identifier.

In a particular example which uses a definition of a plurality ofranges, a numeric identifier is generated for each field point pair andused as a ‘seed’ for a pseudo-random number generator. Measurements aretaken of the following characteristics:

-   -   the field type (one of four) for each field point    -   the field energy for each field point    -   the distance between the field points

There are 10 possibilities since there are 10 possible combinations of 4field types, and these can therefore be encoded into a number between 1and 10.

Ranges with a width that can be considered as an ‘energy precisionparameter’ are defined for the energies. These ranges are used toconvert each field point energy (measurement value) into an integer. Forexample:

-   -   0-5 becomes 1    -   5-10 becomes 2    -   10-15 becomes 3        and so on.

The energy precision parameter determines the width of the ranges, whichin the example above is 5.0. This means that field points with energyvalues between 0 and 5 are considered to be the ‘same’, those between 5and 10 are the ‘same’ and so on.

The field point pair distance needs to be similarly encoded. Suitably,each possible distance is assigned an integer, such that if twodistances are to be considered the ‘same’ then the integer assigned tothem should be the same.

One method uses a constant distance resolution or precision level, so:

-   -   0-1 becomes 1    -   1-2 becomes 2    -   2-3 becomes 3        and so on. This example has a distance resolution of 1, as all        distances are rounded up to the nearest 1 Angstrom.

One example uses 4 ‘precision levels’ which correspond to differentdistance resolutions. In the example the 4 distance resolutions are0.25, 0.5, 1.0 and 2.0. At 0.25, for example, the mapping is such that:

-   -   0-0.25 becomes 1    -   0.25-0.5 becomes 2    -   0.5-0.75 becomes 3        and so forth.

In another example a lookup table is used to define the ranges and mapthe distances to integers. This removes the constraint that the distanceresolution needs to be the same at all distances. For example, higherresolutions can be used at short distances, while lower resolutions canbe used at long distances. In an example the mapping is such that:

-   -   0-0.1 becomes 1    -   0.1-0.2 becomes 2    -   0.2-0.4 becomes 3    -   0.4-0.7 becomes 4    -   0.7-1.0 becomes 5    -   1.0-2.0 becomes 6    -   2.0-5.0 becomes 7    -   5.0-10.0 becomes 8    -   10.0-20.0 becomes 9    -   >20.0 becomes 10

Thus in this example any distance is mapped to a number from 1 to 10 anddistances of 0.23 and 0.53 are seen as ‘different’, but distances of11.0 and 17.0 are the ‘same’, for example.

Once four integers for the field point pair have been generated (the onerepresenting field types, the two representing the field sizes, and theone representing the field distance), these can be combined into asingle integer for the field point pair.

For example, if the field types integer can be 1-10, the size values canbe 1-10, and the distance value can be 1-100, thenK=(distance value)*1000+(size value 1)*100+(size value 2)*10+(typesvalue)encodes these four numbers into one number K in such a way that eachvalue of K uniquely maps to a (dist, size1, size2, types) set. Thisnumber K is the numeric identifier which is then used as the seed to thehash function or pseudo random number generator which is used to set oneor more bits in the bit string.

Thus it will be appreciated that using the above the indexing mechanismcan be configured to define ranges of equal width across all ranges orto define a range for smaller measurement values with a narrower widththan a range for larger measurement values. In a particular embodimentthe indexing mechanism is configured to generate multiple indexes bydefining ranges of different widths for different precision levels.

In a further example a numeric identifier is generated for each fieldpoint pair as follows. Measurements are taken of characteristics whichdo not include the field energy for the field points to generate thenumeric identifier. In the example the following measurements are takento generate the numeric identifier:

-   -   the field type (one of four) for each field point    -   the distance between the field points.

As in the earlier example, the two field types can be encoded into anumber between 1 and 10. This number is used together with the distancevalue to obtain the numeric value.

For example the number between 1 and 10 can be added to the distance(rounded to an integer value) or an explicit mapping can be used. Theexplicit mapping could map all field point pairs of a first field typeand a second field type in a certain distance range to a particularvalue. For example a positive and a negative field point between 4Angstroms and 10 Angstroms apart (e.g. type negative, type positive,distance 6.7 Angstrom apart) could be mapped to a numeric identifier of47.

For a bit string of length n, this numeric identifier can be used togenerate a single number in the range of 1 to n, for example by using asimple one-to-one mapping. For instance, numeric identifier 47 can beused to generate, or be mapped to, the number 47 (i.e. element 47 in thestring).

In this example the values in the string can take real number values(rather than being restricted to integer values). A measurement of thefield energy for each of the field points in the field point pair istaken and the values are converted to a real number. This can be done bycalculating the product or the sum of the two measurements. For example,if the type negative field point is size 6.23 and the type positivefield point is size 2.09, then using the product the real number value(6.23×2.09) is calculated, whereas using the sum gives a real value(6.23+2.09).

The resulting real number value is added to the respective element ofthe string (element 47 in this example).

Using this approach, each position in the string (which can also beconsidered a vector) has a one-to-one correspondence with a “type” offield pair. For example element 47 in the string may be uniquelyidentified with “a positive and a negative field point pair between 4Angstroms and 10 Angstroms apart”. The value stored in the elementdepends on the size of the field points, and is a real number.

Using such an approach each element of the string (or vector)corresponds to a (type 1, type 2, quantized distance) triplet (e.g.element 47 could stand for “negative, positive, 4-10 Angstroms apart”).Consequently, strings of a fixed, known length can be used.

Thus, in one embodiment which uses this approach the length of thestring is set to the number of possible (type 1, type 2, distance)triplets; the deterministic function is set to the identity function(i.e. there is a one-to-one correspondence of the numeric identifier toa single number between 1 and n for a string of length n; and a realnumber value depending upon the size of the two field points is added tothe bin (rather than the bin just being incremented or the bit beingset, as described in relation to earlier examples).

Indexes in the form of bit strings representing field pointrepresentations are stored in a database to allow rapid searching offield point representations. The following section describes sometechniques used to compare a search query with indexes in the database.

III. Searching the Database

Since a known index in the form of a bit string is used in particularembodiments of the present invention, known bit string manipulationtechniques can be used, such as testing for substructures, testing forexact matches, Tanimoto coefficient testing, Euclidian distance testing,Tversky testing and Kulczynski testing.

In one embodiment a searching mechanism is used to search the database.The searching mechanism may be implemented on a computer system assoftware, firmware or hardware, although in a particular embodiment itis implemented as software.

Suitably, the searching mechanism is configured to:

(i) compare a query index with an index of a field point representationfor a record in the database;

(ii) identify the record as a hit if the comparison satisfies a searchcriterion; and

(iii) repeat (i) and (ii) for a plurality of records.

The plurality of records can be all of the records in the database or asubset of these.

The searching mechanism can be further configured to:

receive a search query identifying a field point representation; and

form the query index by generating an index of the field pointrepresentation identified by the search query.

In one embodiment the searching mechanism is configured to form thequery index by using the indexing mechanism to generate an index of thefield point representation identified by the search query. Suitably, thesearching mechanism is configured to generate the query index as a bitstring.

The processes involved in the execution of field pattern searching in aparticular example are given below.

1. Using a suitable interface, for example a GUI, a user selects

a. The field pattern to be used as the query. This may be from:

-   -   i. A conformations field pattern already registered to the        database.    -   ii. An external file in the XED format (the system could be        developed to allow external files in other formats to be used)

b. The comparison type to be used for the search.

c. If a similarity comparison is chosen the user is required to providethe maximum and minimum similarity range that will be regarded as a hitduring the comparison.

d. The precision level at which the search should be carried out.

2. On submitting the query the interface passes information to thedatabase.

3. The database then

a. Creates a fingerprint (bit string representation of the fieldpattern) for the query at the required precision level.

b. Creates a temporary table to hold the results.

C. Searches all of the fingerprint indexes (at the requested precisionlevel) stored in the database.

d. Writes information to the temporary results table regarding any hit.

e. When the search is complete the database informs the interface inwhich table the results are held.

f. The interface then selects the information from the table anddisplays it to the user.

g. Once the user has finished viewing the results the interface tellsthe database to delete the table holding the results.

FIG. 2 is a flow diagram illustrating the fingerprint searching for theparticular example.

In a particular embodiment the searching mechanism is configured to usea true/false matching technique to compare a search query with a record.True/false matching techniques that can be used in the currentembodiment include an exact pattern technique, a sub pattern techniqueand a super pattern technique.

The searching mechanism can also be configured to use a similaritymeasuring technique to compare the search query with the record. In oneembodiment, similarity measuring techniques that can be used include aEuclidian distance technique, a streetcar distance technique, a subpattern similarity technique, a super pattern similarity technique, aTanimoto similarity technique, a dice technique, and a Tverskysimilarity technique. A Kulczynski technique is used in a particularembodiment.

The searching mechanism is configured to identify a record as a hitdependent on a similarity measure produced by the similarity measuringtechnique being in a range from a minimum similarity value to a maximumsimilarity value.

In a particular embodiment the searching mechanism is configured tosearch by precision level. Suitably, this is done by generating an indexof the field point representation at a required precision level to formthe query index and comparing the query index with an index at the sameprecision level of a field point representation for a record in thedatabase.

A user can submit a search query through a user interface. The searchingmechanism stores the hits in a results table which is used to displaythe results to the user through the interface. In embodiments of theinvention any suitable user interface, for example a graphical userinterface (GUI), may be provided to enable a user to interact with thedatabase.

Finally, it is noted that, although it is possible to apply the searchmethod with a fixed similarity criterion (eg ‘return all records with aTanimoto similarity >0.8’), it is usually preferable to calculate thesimilarity value for all records in the database, use these values torank the database, and then output the top, i.e. most similar, Ncompounds.

IV. Database Overview

FIG. 3 shows an overview of the database. In the illustrated embodimentthe database 100 is as an Oracle database (version 8.1.7 or greater). Aseparate user application 102 provides the GUI which is configured toenable a user to interface with data stored in the database. Files 104containing structure data, including data representing field pointrepresentations, are also illustrated.

Import operations (illustrated as 1 in FIG. 3) include importing datafrom the files 104 to the user application 102, transferring data fromthe user application 102 to the database 100 and transferring data fromthe files 104 directly to the database 100. Export operations(illustrated as 2) include transferring data from the database 100 tothe user application or to files 104. Searching (illustrated as 3) canbe performed using the user application 102, optionally using data froma file 104. Browsing the database (illustrated as 4) can be performedusing the user interface (e.g. a GUI) of user application 102.

The database comprises tables 106 comprising data 108 and views 110 forviewing data split across more than one table. The database alsocomprises packages 112 comprising public functions and procedures usedby the user application and private functions and procedures usedinternally to execute particular tasks (for example to executesearching). The database also comprises sequences 114 for providingconsecutive numbering for items in the database.

Referring back to the index mechanism and the searching mechanism, theseare implemented as software functions/procedures in the database of theillustrated embodiment.

Creation and maintenance of the features within the database areachieved using conventional techniques and methods supported by theOracle database environment. In the illustrated embodiment allprocedures and functions have an SQL interface and the code executed bythe procedure or function may be implemented in SQL or Java.

It will be appreciated that in the illustrated embodiment much of thefunctionality of the system is embedded within the database itself, forexample for storing data, retrieving data and searching data.Communication between the user interface (e.g. a GUI)/user application102 and database 100 is achieved using conventional protocols, forexample ADO although any suitable protocol can be used.

The user application 102 is written in Visual Basic and may be run inany standard Windows PC environment. In the most part the user interface(e.g. a GUI) communicates with the database through the packagesembedded within the database. The user interface can also directlyaccess data from the tables for display purposes, such as recordbrowsing.

The user interface enables a user to input data to the database, tooutput data from the database, to delete data from the database, toupdate data in the database, to browse the database, to search thedatabase, and to display search results.

V. Database Structure

This section details the physical structure of the database schema of aparticular embodiment. An overview of the tables of the database schemais given in FIG. 4.

The database schema is centred on the Objects table. This holds thetop-level Information for each molecule registered. Each Molecule has asingle entry in the objects table and is uniquely identified by aspecific ID allocated at registration. This ID is used throughout theother tables in the schema to identify items related to that molecule.The structures table holds all of the structure information (an entryper conformation) for each molecule. This allows the structure of anyconformation to be retrieved, interpreted and displayed by a suitableapplication connecting to the database. In the particular embodiment thestructure information is held within the table as a Binary Large Object(BLOB) data-type.

General properties for each molecule are held in the objects table,whilst properties specific to a conformation are held each in thestructures table.

When a molecule is registered to the database a Type and Source must besupplied. These must match allowed items for the Type and Source definedin the Type_Dict and Source_Dict tables.

The Source identifier allows the association of a molecule and hence itsconformations with a particular source. The user may give any name to asource that has meaning to them. This could be used to track companiesor projects within the database, for example MDR, HIV, or MayBridge.

The Type identifier allows the association of a molecule and hence itsconformations with a particular type. The user may give any name to atype that has meaning to them. This could be used to track differententity types, for example Molecule, Fragment, Building Block or FieldTemplate.

Any number of source and types can be created in the database, howeveronly one source and type can be associated with a given molecule and itsconformations.

The chemical structures stored in the Structures table are a completerepresentation of the information supplied at registration time i.e.chemical structure and field point representation (field pattern).However they are not used for searching. The schema provides a separateFieldprints table to hold data generated at registration time which ismore applicable to field searching.

V.1 Tables

The tables of the schema will be described in turn.

Objects Table

The objects table holds the top-level information for each entry in thedatabase. One entry per molecule will exist in this table.

Where data integrity is to be maintained constraints have been created,i.e. it is not possible to register an entry to the table with an IDthat already exists, or with a TypeID or SourceID that does not exist inthe appropriate table.

Table Structure FIELD DATA TYPE NULL DESCRIPTION Constraint ConstraintLINK OBJECTID NUMBER (11) N Internal ID created PKEY, from a sequenceNAME VARCHAR2 N Supplied data from (255) import file DESCRIPTIONVARCHAR2 Y Supplied data from (255) import file TYPEID NUMBER (11) NSupplied data from list FKEY Type_Dict:Typeid of allowed types SOURCEIDNUMBER (11) Y Supplied data from list FKEY Source_Dict:Sourceid ofallowed dictionary sources MOLFORMULA VARCHAR (25 Y Calculated from the5) structure MOLWIEGHT NUMBER Y Calculated from the (11, 4) structureNUMSTRUCTURES NUMBER (11) Y Calculated from the number of entries forthis molecule registered in structures table. MAXENERGY NUMBER 11, 4 YCalculated from the max energy of the conformations registered for thismolecule in the structures table MINENERGY NUMBER 11, 4 Y Calculatedfrom the min energy of the conformations registered for this molecule inthe structures table IMPORTFILE VARCHAR2 Y The file the molecule (255)was imported from TIMESTAMP NUMBER 15 N System assigned registrationdateStructures Table

The structures table holds data about each and every conformation loadedinto the database. A sequence number is assigned internally todifferentiate the conformers for a particular molecule.

Table Structure FIELD Data TYPE NULL DESCRIPTION Constraint ConstraintLINK OBJECTID NUMBER N FKEY Objects:Objectid (11) STRUCTURESEQ NUMBER NThe particular number of NO (11) conformation stored for this moleculeSTRUCTURE BLOB N Binary storage of the structure from the import fileENERGY NUMBER 11 Y Supplied data from import file TIMESTAMP NUMBER 15 NSystem assigned registration dateFieldprints Table

The Fieldprints table holds the data created for searching of the fieldpoint representation or field pattern. In the particular embodiment thisdata is created at various precision levels. Each precision level has anentry within the table. In the particular embodiment four precisionlevels are used.

A fingerprint is created for each and every conformation stored in thedatabase from its field point representation. All fingerprints of thesame precision level are combined into a single blob for rapidsearching.

Table Structure FIELD Data TYPE NULL DESCRIPTION Constraint ConstraintLINK IDXLEVEL NUMBER (11) N The precision level at which the PKEY indexwas created IDXPRINT BLOB N The blob containing data at specifiedprecision for all structures containing fields TIMESTAMP NUMBER 15 NSystem assigned registration dateType_Dict Table

This table stores all of the dictionary items that may be assigned tothe molecule being registered.

Table Structure Constraint FIELD Data TYPE NULL DESCRIPTION ConstraintLINK TYPEID NUMBER (11) N Internal ID created from a PKEY, sequenceUNIQUE NAME VARCHAR2 N User supplied data (255) DESCRIPTION VARCHAR2 YUser supplied data (255) TIMESTAMP NUMBER 15 N System assignedregistration dateSource_Dict Table

This table stores all of the dictionary items that may be assigned tothe molecule being registered.

Table Structure Constraint FIELD Data TYPE NULL DESCRIPTION ConstraintLINK SOURCEID NUMBER (11) N Internal ID created from a PKEY, sequenceUNIQUE NAME VARCHAR2 N User supplied data (255) DESCRIPTION VARCHAR2 YUser supplied data (255) TIMESTAMP NUMBER 15 N System assignedregistration dateResults_(X) Table

This table stores the results obtained from any fingerprint search andis transitional.

Each fingerprint search will have its own results table created and isidentified by the _(X) part of the table name. The X is assignedinternally as the next number from a sequence.

This table is usually deleted when no longer required by the userapplication

Table Structure Constraint FIELD Data TYPE NULL DESCRIPTION ConstraintLINK OBJECTID NUMBER (11) N OBJECTIDSEQNO NUMBER (11) Y SIMILARITYNUMBER (6) Y

Any suitable database and database schema may be used to implement thepresent invention.

V.2 Database Packages

The use of functions and procedures within the Oracle databaseenvironment allows complex tasks to be completed with a single call tothe database. They also provide a way of masking the complexity of thedatabase to a user or application, i.e. the user does not have to knowthe internal detail of the database schema, to register various bits ofinformation, they need only supply the data to a procedure or functionhappy in the knowledge that the method knows how to deal with it.

Functions and procedures can also be amalgamated into packages. In thepresent implementation, the call interface for all functions andprocedures is declared using SQL since this is the language of thedatabase environment. However the executable code may be written in SQL,C, Java, or a mixture of these languages.

The use of packages allows procedures and functions to be specified aspublic and private. Calls made externally to the database may only usepublic methods.

The database environment of the present embodiment has three packages.One package (PACK_CBMD_REG) is concerned with registration of moleculesand their conformations along with all of the information (such as thefingerprints) into the database tables. A second package(PACK_CBMD_CHEM) is concerned with searching the fingerprint (theindexes). A third package (PACK_CBMD_UTILS) contains general utilitiesused by the other two packages.

VI. Computer System

FIG. 5 shows a schematic and simplified representation of a computersystem 200. The computer system 200 comprises various data processingresources such as a processor (CPU) 230 coupled to a bus structure 238.Also connected to the bus structure 238 are further data processingresources such as read only memory 232 and random access memory 234. Adisplay adapter 236 connects a display device 218 having screen 220 tothe bus structure 238. One or more user-input device adapters 240connect the user-input devices, including the keyboard 222 and mouse 224to the bus structure 238. An adapter 241 for the connection of theprinter 221 may also be provided. One or more media drive adapters 242can be provided for connecting the media drives, for example the opticaldisk drive 214, the floppy disk drive 216 and hard disk drive 219, tothe bus structure 238. One or more telecommunications adapters 244 canbe provided for connecting the computer system to one or more networksor to other computer systems or devices.

In operation the processor 230 runs computer software by executingcomputer program instructions and operating on data that may be storedin one or more of the read only memory 232, random access memory 234 thehard disk drive 219, a floppy disk in the floppy disk drive 216 and anoptical disc, for example a compact disc (CD) or digital versatile disc(DVD), in the optical disc drive or dynamically loaded via adapter 244.The results of the processing performed may be displayed to a user viathe display adapter 236 and display device 218. User inputs forcontrolling the operation of the computer system 200 may be received viathe user-input device adapters 240 from the user-input devices.

Computer software comprising data files and executable files or computerprograms for implementing various functions or conveying variousinformation can be written in a variety of different computer languagesand can be supplied on carrier media. Software comprising a program orprogram element may be supplied on one or more CDs, DVDs and/or floppydisks and then stored on a hard disk, for example. Software may also beembodied as an electronic signal supplied on a telecommunicationsmedium, for example over a telecommunications network. Examples ofsuitable carrier media include one or more selected from: a radiofrequency signal, an optical signal, an electronic signal, a magneticdisk or tape, solid state memory, an optical disk, a magneto-opticaldisk, a compact disk and a digital versatile disk.

It will be appreciated that the architecture of a computer system couldvary considerably and FIG. 5 is only one example.

In the present example computer software configured to provide thedatabase is stored on the computer system.

REFERENCES

-   [1] ‘Substructure search of chemical structure files’; pp 157-181,    and ‘Chemical structure search systems and services’; pp 182-202, in    communication, storage and retrieval of chemical information, Ash    J., Chubb P., Welford S., Willet P. (Eds). Ellis Horwood,    Chichester, 1985.-   [2] Barnard J. M.; ‘Structure representation and searching’; pp    9-56, in Chemical Structure Systems, Ash J. E., Warr W. A.,    Willet P. (Eds), Ellis Horwood, Chichester, 1991.-   [3] Mooers C. N.; ‘Zatocoding applied to mechanical organization of    knowledge’; Amer. Doc., 2, 20-32, January 1951.-   [4] Mooers C. N.; ‘Zatocoding and developments in information    retrieval’; ASLIB Proceedings, 8(1), 3-22, February 1956.-   [5] J G Vinter: Journal of Computer-Aided Molecular Design: volume    8(1994) pages 653-668.-   [6] J G Vinter and K I Trollope: Journal of Computer-Aided Molecular    Design: volume 9(1995) pages 297-307.

1. A computer system comprising a database having a plurality ofrecords, wherein each record comprises a field point representationrepresenting field extrema for a conformation of a chemical structure,each record having an index of the field point representation, whereinthe index is a searchable representation of the field pointrepresentation and the index is a string.
 2. The computer system ofclaim 1, wherein the database includes records for multipleconformations of the same chemical structure.
 3. The computer system ofclaim 1r, wherein each record further comprises a structuralrepresentation of the chemical structure.
 4. (canceled)
 5. (canceled) 6.The computer system of claim 1, each record having multiple indexes ofthe field point representation, wherein the multiple indexes arerepresentations of the field point representation at different precisionlevels.
 7. The computer system of claim 1, comprising an indexingmechanism for generating an index of a field point representation. 8.The computer system of claim 7, wherein an index is a string of length nand the indexing mechanism is configured to: (i) generate a numericidentifier from a characteristic of the field point representation; (ii)generate one or more numbers in a range from 1 to n in dependence on thenumeric identifier; (iii) increment the bins in the string thatcorrespond to the one or more numbers; and (iv) optionally repeat (i) to(iii) for another characteristic of the field point representation. 9.The computer system of claim 8, wherein a characteristic of the fieldpoint representation includes one or more of: the number of field pointsof a particular field of the field point representation; the particularfield and energy of a field point in the field point representation; andthe respective energies of and distance between a field point pairing inthe field point representation.
 10. The computer system of claim 8,wherein the indexing mechanism is configured to take a measurement of acharacteristic of the field point representation to generate the numericidentifier.
 11. The computer system of claim 10, wherein the indexingmechanism is configured to take a measurement of the characteristic ofthe field point representation at different levels of precision togenerate corresponding multiple indexes which represent the field pointrepresentation at different precision levels.
 12. The computer system ofclaim 8, wherein the indexing mechanism is configured to: (i) define aplurality of ranges of possible measurement values; (ii) take ameasurement of a characteristic of the field point representation toproduce a measurement value; (iii) assign the measurement value to arange if the measurement value is within the range; (iv) optionallyrepeat (ii) and (iii); and (v) use the number of measurement valuesassigned to the range to generate the numeric identifier.
 13. Thecomputer system of claim 12, wherein the indexing mechanism isconfigured to define ranges of equal width across all ranges.
 14. Thecomputer system of claim 12, wherein the indexing mechanism isconfigured to define a range for smaller measurement values with anarrower width than a range for larger measurement values.
 15. Thecomputer system of claim 12, wherein the indexing mechanism isconfigured to generate multiple indexes by defining ranges of differentwidths for different precision levels.
 16. The computer system of claim8, wherein the indexing mechanism is configured to generate one or morenumbers in a range from 1 to n in dependence on the numeric identifierby using a deterministic function.
 17. The computer system of claim 16,wherein the deterministic function is a pseudo-random number generatoror a hash function.
 18. The computer system of claim 8, wherein the binsin the string take real number values.
 19. The computer system of claim18, wherein the real number value is generated from the energies of apair of field points in the field point representation.
 20. The computersystem of claim 1, comprising a searching mechanism for searching thedatabase.
 21. The computer system of claim 20, wherein the searchingmechanism is configured to: (i) compare a query index with an index of afield point representation for a record in the database; (ii) identifythe record as a hit if the comparison satisfies a search criterion; and(iii) repeat (i) and (ii) for a plurality of records.
 22. The computersystem of claim 21, wherein the searching mechanism is furtherconfigured to: receive a search query identifying a field pointrepresentation; form the query index by generating an index of the fieldpoint representation identified by the search query.
 23. The computersystem of claim 20, wherein the search mechanism is configured to searchby precision level.
 24. A database for implementation on a computersystem, the database configured to support a plurality of records, eachrecord comprising a field point representation representing fieldextrema for a conformation of a chemical structure, the database furtherconfigured to support each record having an index of the field pointrepresentation, wherein the index is a searchable representation of thefield point representation and the index is a string.
 25. (canceled) 26.(canceled)
 27. The database of claim 24, configured to support eachrecord having multiple indexes of the field point representation,wherein the multiple indexes are representations of the field pointrepresentation at different precision levels.
 28. The database of claim24, comprising an indexing mechanism for generating an index of a fieldpoint representation.
 29. The database of claim 24, comprising asearching mechanism for searching the database.
 30. Computer softwareconfigured to provide the database of claim
 24. 31. A carrier mediumcarrying computer software configured to provide the database of claim.32. A method of generating an index of a field point representationrepresenting field extrema for a conformation of a chemical structure,wherein the index is a string with n elements, the method comprising:(i) generating a numeric identifier from a characteristic of the fieldpoint representation; (ii) generating one or more numbers in a rangefrom 1 to n in dependence on the numeric identifier; (iii) incrementingthe string elements that correspond to the one or more numbers; and (iv)optionally repeating (i) to (iii) for another characteristic of thefield point representation.
 33. The method of claim 32, comprisingtaking a measurement of a characteristic of the field pointrepresentation to generate the numeric identifier.
 34. The method ofclaim 33, comprising taking the measurement of the characteristic of thefield point representations at different levels of precision to generatecorresponding multiple indexes which represent the field pointrepresentation at different precision levels.
 35. The method of claim32, comprising (i) defining a plurality of ranges of possiblemeasurement values; (ii) taking a measurement of a characteristic of thefield point representation to produce a measurement value; (iii)assigning the measurement value to a range if the measurement value iswithin the range; (iv) optionally repressing (ii) and (iii); and (v)using the number of measurement values assigned to a range to generatethe numeric identifier.
 36. The method of claim 35, comprising definingranges of equal width across all ranges
 37. The method of claim 35,comprising defining a range for smaller measurement values with anarrower width than a range for larger measurement values.
 38. Themethod of claim 35, comprising generating multiple indexes by definingranges of different widths for different precision levels.
 39. Themethod of claim 32, comprising using a deterministic function togenerate one or more numbers in a range from 1 to n in dependence on thenumeric identifier.
 40. The method of claim 39, wherein thedeterministic function is a pseudo-random number generator or a hashfunction.
 41. A method of searching a database having a plurality ofrecords, each record comprising a field point representationrepresenting field extrema for a conformation of a chemical structureand having an index of the field point representation wherein the indexis a string, the method comprising: (i) comparing a query index with anindex of a field point representation for a record in the database; (ii)identifying the record as a hit if the comparison satisfies a searchcriterion; (iii) repeating (i) and (ii) for a plurality of records; and(iv) outputting a representation of the records identified as a hit. 42.The method of claim 41, further comprising: receiving a search queryidentifying a field point representation; and forming the query index bygenerating an index of the field point representation identified by thesearch query.
 43. The method of claim 41, the method further comprisingsearching by precision level.